home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Collection of Tools & Utilities
/
Collection of Tools and Utilities.iso
/
tex
/
docprep.zip
/
DOCPREP.MAN
< prev
next >
Wrap
Text File
|
1992-05-14
|
129KB
|
2,719 lines
OCCASIONAL PUBLICATIONS IN ACADEMIC COMPUTING
Number 7
DOCUMENT PREPARATION AIDS FOR NON-MAJOR LANGUAGES
by
Andy Black, David Weber,
Fred Kuhl, and Kathy Kuhl
Summer Institute of Linguistics, Inc.
Dallas, TX
1987
Occasional Publications in Academic Computing is devoted to
publishing computer software and documentation deemed to be of
potential usefulness to members of the Summer Institute of
Linguistics for carrying out their field projects in linguistics,
literacy, anthropology, and translation. The software published in
the series may represent work in progress. In publishing this
software, the Summer Institute of Linguistics, Inc. is making no
commitment to maintenance, but is committed to making full
disclosure of source code in cases where maintenance requests cannot
be serviced.
EDITOR: Gary F. Simons
ASSISTANT EDITOR: Linda L. Simons
This manual documents the WRDCHG, SYLCHK, SYLCOR, SPLCOR, HYPHEN,
and DELIM programs. These programs are written in the C programming
language for on-the-field application using personal computers or
small time-sharing systems. They run under the RT-11, MS-DOS
(including Sharp PC5000), TSX, and UNIX operating systems.
Copyright (c) 1987, by Summer Institute of Linguistics, Inc.
Editorial correspondence or program bugs should be addressed to:
Academic Computing
Summer Institute of Linguistics
7500 West Camp Wisdom Road
Dallas, TX 75236
Requests for further copies, standing orders, or accompanying
software diskettes should be addressed to:
Bookstore
Summer Institute of Linguistics
7500 West Camp Wisdom Road
Dallas, TX 75236
CONTENTS
1. INTRODUCTION 4
1.1 Overview of program functions 4
1.2 Overview of program structure 5
1.3 Some lessons from history 6
2. WORD CHANGE (WRDCHG) 8
2.1 Introduction 8
2.2 Making a change table 8
2.3 The default mode 11
2.4 Making a standard format marker field file 11
2.5 Running the program 12
3. SYLLABLE-BASED SPELLING CHECKING (SYLCHK) 14
3.1 Introduction 14
3.2 Running the program 14
3.3 The form of the output 16
3.4 How to write the ONC file 16
3.5 How to write an orthography change table 17
4. SYLLABLE-BASED SPELLING CORRECTION (SYLCOR) 18
4.1 Introduction 18
4.2 Initiating a session with SYLCOR 19
4.3 Screen layout 23
4.4 Handling possible errors: word edit mode 24
4.5 Making the auto-correction and exceptions files 25
4.6 Ending a session with SYLCOR 26
4.7 Writing your own auto-correction and exception files 26
5. SPELLING CORRECTION WITH TABLE LOOKUP (SPLCOR) 27
6. HYPHENATION (HYPHEN) 27
6.1 Introduction 27
6.2 Data files 28
6.3 Running the program 32
6.4 Examples 34
6.5 Miscellaneous 40
7. DELIMITER CHECKING AND NESTING CHECK (DELIM) 42
7.1 Introduction 42
7.2 Running the program 42
7.3 The form of the output 43
7.4 How to write a delimiter file 44
7.5 Program limitations 44
DOCUMENT PREPARATION AIDS 4
1. INTRODUCTION
The programs described in this booklet are aids to producing
documents. They are useful for a wide range of languages. Each
arose in response to a need felt by field linguists involved in
producing documents in non-English languages.
1.1 Overview of program functions
WRDCHG makes changes to the words of a text, while preserving
capitalization, punctuation and formatting. It is useful for
correcting spelling and typographic errors, or even for adapting
text between closely related dialects. It is simple to use, it
allows for conditioning in terms of word boundaries, and it is
efficient when hundreds of changes are involved because it stores
the changes in a dense form and because it is fast.
SYLCHK identifies potential spelling errors in text, using
decomposition into syllables as the method for identifying possible
errors, and returns these as a list. The user supplies information
about the syllable structure of the language.
SYLCHK and WRDCHG work together to correct many spelling
errors. SYLCHK is first run on the text to collect potential
errors. This list is then (optionally) sorted and duplicates are
eliminated, and then it is edited to make a list of changes. These
changes are then made to the text with WRDCHG.
However, this method has a weakness: without context, the user
may not know how to correct some errors. For example, if the error
were ther, one would not know whether it should be corrected to the,
their, there, other, or something else. This sort of case motivated
the next program.
SYLCOR is an interactive editor for correcting potential
errors. SYLCOR identifies potential errors by the syllable
decomposition algorithm used in SYLCHK, using the same data files as
SYLCHK. When a potential error is found, it is displayed in the
upper portion of the screen with the surrounding text and in a work
area in the lower portion of the screen, where it can be corrected.
If the word is modified, the user may make the change an automatic
correction. If it is not modified, the user may add it to one of
various lists of exceptions (for example, names, loan words,
acronyms, and so on).
SPLCOR is like SYLCOR except that, rather than using syllable
decomposition for detecting errors, it assumes that a word is an
error unless it is found on one of the exceptions lists. This may
be a useful approach for languages where the writing system or the
phonology (or both!) make syllable decomposition ineffectual as an
error detection algorithm. The user simply accumulates a list of
all words which are to be passed without further attention.
This brings up an interesting question: What are some other
useful error detection methods? Syllable decomposition has proven
to be useful in many languages, particularly where syllables are
fairly restricted and the writing system represents the phonology
closely. But it will not yield the same results for every language;
Introduction 5
for example, it is less effective for Spanish than it is for
Quechua.
Another possibility is morphological parsing, i.e.
decomposition into morphemes rather than into syllables. For
Quechua, morphological parsing is a more effective method than
syllable decomposition, but it is also more costly in terms of the
complexity of the program, the data which must be provided by the
user, and the data which must be loaded each time the program is
run.
There are other schemes that have been used other languages.
One algorithm for English passes a three-character window over the
word, looking up the probability for the occurrence of each
character triple in a table. (These probabilities are established by
running the program in a training mode on large portions of correct
text.) The word is rejected or passed as a function of the
probabilities of its character triples.
I leave the following question with the reader: for the
language to which you wish to apply spelling error detection, what
would be the best method of detecting possible errors? If you come
up with a new idea, perhaps we can prepare alternative programs
which are like SYLCOR and SPLCOR, but which have different error
detection algorithms. SPLCOR provides the skeleton into which other
algorithms for error detection -- ones that you devise -- could be
inserted; the program source code is available for those who wish to
give it a try.
HYPHEN introduces a user-determined character at syllable
boundaries. This can be used as a "discretionary hyphen" for
formatting with a program like Manuscripter. The user provides data
in terms of which the program recognizes syllable boundaries. The
user can control how close to the word boundaries the discretionary
hyphen may occur, so as to avoid stranding parts of words which are
too small.
DELIM checks text to see that delimiters (characters like quote
marks, brackets, braces, parentheses, and so on) are paired and
properly nested. This is useful for technical papers and for
computer programs, both of which often contain a great many
delimiters. The user has control over what DELIM regards as an
opening delimiter character and what is the corresponding closing
delimiter. DELIM reports errors by giving the line number, the
line, and indicating the offending delimiter.
1.2 Overview of program structure
WRDCHG, SYLCOR, SPLCOR, and HYPHEN share the same basic
program structure, as proposed in Weber and Kasper "Getting at the
Words in Text," Notes on Linguistics 2:17-22 (1983). The module
which performs the particular action on a word is lodged between a
module TXTIN which separate the word from other characteristics of
the text (capitalization, punctuation, formatting), a module TXTOUT
which recomposes the text with the possibly-modified word in place
of the original word. See the following diagram:
DOCUMENT PREPARATION AIDS 6
+--------+
words ------- | ACTION | ---- (modified) words
| +--------+ |
| |
+---------+ punct,capit,format +---------+
| TXTIN | ---------------------- | TXTOUT |
+---------+ +---------+
| |
input text output text
(SYLCHK uses the TXTIN module, but since it does not produce an
output text, it does not use TXTOUT.) Because these programs share
this structure, they share a lot of code, facilitating both
development and maintenance. I suspect that other, future programs
could benefit from this architecture, and perhaps even the TXTIN and
TXTOUT modules.
1.3 Some lessons from history
A bit of history is in order, particularly since it is
instructive as to how programs such as these can arise in response
to needs felt by field linguists.
My involvement in the development of these programs (exclusive
of HYPHEN) has been to see the need for a program, to get an
approximate conceptualization of the program, to write out some
elementary design, to interact with the implementors (answering
questions about how I think it should work, providing test data, and
so on), and helping to write documentation.
The programing expertise was virtually all contributed by
volunteers. The first volunteer was Bob Kasper. Bob came to Peru
upon finishing his B.S. at Cornell University to implement the
Computer Assisted Dialect Adaptation program. As part of this he
wrote the TXTIN and TXTOUT functions. The CADA program required a
change module, so after that was developed, I suggested that Bob
make the WRDCHG program by putting that module between TXTIN and
TXTOUT. Since all the pieces were there, it was not a major job,
and the first version of WRDCHG was born. About the same time, I
began learning the C programming language, and wrote the first
version of SYLCHK and DELIM with Bob's help.
During Bob's stay in Peru, Alex Waibel (who worked in speech
research at Carnegie-Mellon University) came to Peru for a two week
"working" vacation. Bob and I had a design document ready for Alex,
and about a week and a half after arriving, Alex had a working
editor, called CADAED, for application to CADA output text.
About two years later, Fred and Kathy Kuhl came to Peru for a
six week period. Fred had just finished his doctorate in Computer
Science and Kathy had taken several courses in programming. I had
written a design of SYLCOR based on my experience with a spelling
corrector on another system, and on Bob's TXTIN and TXTOUT, my
SYLCHK, and Alex's CADAED. I also had some ideas for how WRDCHG,
SYLCHK and DELIM could be improved. Fred and Kathy went right to
work, Fred on WRDCHG and SYLCOR, and Kathy on SYLCHK and DELIM.
Introduction 7
When Fred and Kathy left six weeks later, the programs were as they
now are.
SYLCOR incorporates work which Bob, Alex, Kathy and I did,
combined masterfully by Fred. Thus, for me, SYLCOR is a monument to
cooperation, volunteerism, and professionalism. Bob, Alex, Kathy,
and Fred contributed their skills, writing code which others could
build upon or building on the work of the former. My role was
simply to orchestrate this development.
My experience with these programs has confirmed something I
first learned by working with Bill Mann: that "linguistic" software
is probably best developed as a collaboration between the linguist
and the computer professionals. The linguist must identify the
problem(s) for which software is needed, conceptualize a program
(which must be computationally tractable), and then communicate this
to the computer professional, whose responsibility is to refine the
linguist's conceptualization and produce the code. And, computer
professionals who are willing to go to the field (to where the
linguist confronts the situation for which he feels the need for a
program) can make a large contribution, even if they only stay a
short while.
The development of the HYPHEN program suggests another lesson.
HYPHEN was written by Andy Black in response to an obvious need to
introduce discretionary hyphens for the text formatting demands in
the SIL computer center he manages. Andy could have started from
scratch and written the program entirely himself. But, being
familiar with the architecture and code used for WRDCHG, he used
TXTIN and TXOUT. This accelerated his development effort, and will
save program maintenence time in the future.
Andy's example makes me optimistic about the development of
other programs -- as yet unanticipated -- which can be built without
exorbitant effort from program parts which are already in hand. If
we can make our software development cooperative in this way, each
building as much as possible on the work of others rather than
starting from scratch for every program, and if, as discussed
earlier, we can bring together the linguist and the computer
professional, then perhaps we might be able to fulfill -- to a large
measure -- our need for linguistic software.
There are other people whose names do not appear as authors but
who have contributed considerable effort in bringing this
publication to reality. Steve McConnel ported the programs to the
other operating systems and in doing so cleaned up several
inconsistencies within and between the programs. Gary Simons
provided general editorial advice and offered suggestions to make
the programs more general so they could be used in language families
quite different from the one they were originally designed to work
for. Linda Simons tested the ported verions along with Steve and
took the documentation through several updates to keep it in line
with the program improvements.
DOCUMENT PREPARATION AIDS 8
2. WORD CHANGE (WRDCHG)
2.1 Introduction
Word Change (WRDCHG) passes over a text, changing words as
specified by the user in a change table. WRDCHG can only change
words; it cannot change punctuation, format marking or
capitalization. (Each output word will have the capitalization of
the corresponding input word.) It is possible to condition changes
as applying only at word boundaries. The speed of application is
not substantially affected by the number of changes in the change
table; a large number (perhaps as many as 1500) can be made quickly.
It also can apply the changes only to specified standard format
fields. This gives the ability to make changes to only the
vernacular entries of a dictionary, for example.
2.2 Making a change table
A change table is a list of paired strings, each string bounded
by double quotes ("). The first string of a pair is called the
"match string"; it specifies some pattern to be matched in a text.
The second string, called the "substitution string," specifies what
is to be substituted for each occurrence of the matched string.
Observe the following in writing a change table:
1. The changes in a table may occur in any order (i.e., the
order in which changes occur in a table makes no difference
in the effect upon any text). Therefore changes cannot be
"ordered." That is, a second change dependent upon a
condition created by a first change will not work. For
example, if the following two changes are in a table, only
the first will occur since the program will not scan the
input text a second time to find "bi?u".
"'" "?"
"bi?u" "bi?o"
2. All changes should be given in lower case. It is not
necessary to give a change with various capitalizations, as
the result of any change will be capitalized just as the
original word. For example, the change
"yeild" "yield"
will change "yeild" to "yield", "Yeild" to "Yield" and
"YEILD" to "YIELD". (WRDCHG recognizes only three
possibilities, all lower case, all upper case, first
character capitalized.)
3. If a character (other than space or tab) appears on a line
before the first double quote mark, then that line is
regarded as a comment, and any change on that line is not
applied. This provides a simple mechanism for disabling a
change: simply put some character ahead of the first string.
For example, the following line would not make any change:
off "this" "that"
Word Change 9
4. Any character(s) may be placed between the left and right
strings. This allows whatever notation you like to
symbolize the change; The following lines have the same
effect:
"mispelled" becomes "misspelled"
"mispelled" --> "misspelled"
"mispelled" > "misspelled"
"mispelled" "misspelled"
5. Anything following the right string is ignored, so comments
may follow the pair of strings; for example, the following
three changes are effective:
"kachaka" "alliya" `get well'
"qo" "qara" `give'
"fiyupa" "aliska" `very much'
6. Changes may be specified as applying (a) only at the
beginning of a word, (b) only at the end of a word, or
(c) only if the complete word is matched. To specify that a
change applies only at the beginning of a word, include a
space between the leading double quote and the first
character of the match string; for example, the following
change affects only the first "ka" in "kaykan":
" ka" "ke"
To specify that a change applies only at the end of a word,
include a space between the final character of the match
string and the following double quote; for example, the
following change affects only the last "na" of
"nakananpaqna":
"na " "nya"
To specify that a change applies only when the complete word
is matched, include spaces both at the beginning and end of
the match string; for example, the following changes the
word "na" when it stands alone, but would not make any
change to "nakananpaqna":
" na " "nya"
7. A change table may have multiple changes whose match string
has the same character string but which differ in terms of
boundary conditions. The order of priority for application
of changes whose match strings are the same except for
boundary conditions is 3 > 2 > 1 > 0 where
(0) anywhere within a word
(1) only at the end of a word
(2) only at the beginning of a word
(3) only when the entire word is matched
That is, 3 applies in preference to 0-2, 2 applies in
preference to 1 and 0, and 1 applies in preference to 0. (A
way to think of this is that the change with the most
DOCUMENT PREPARATION AIDS 10
restricted conditions is applied in preference to a change
with a less restricted condition.) For example, consider
Change Table I:
TABLE I
"na" "naa" (0) anywhere
"na " "nac" (1) only at the end
" na" "nab" (2) only at the beginning
" na " "nad" (3) if complete word
Change Table I changes "Nakamaananpaqna" to
"Nabkamaanaapaqnac". The first instance of "na" is changed
to "nab" because the change with the "word-initial"
condition (2) applies in preference to the change with the
"anywhere" condition (0). Likewise, the last instance of
"na" becomes "nac" by the change with the "word-final"
condition (1) because it applies in preference to the change
with the the "anywhere" condition (0). The second instance
of "na" is changed by the "anywhere" change because that is
the only change whose conditions are met. Change Table I
changes the isolated word "na" to "nac". In this case, all
of the changes are, in principle, applicable, but the one
with the "complete word" condition applies in preference to
the others (0-2).
Further, consider Change Table II:
TABLE II
"na " "nac" (1) only at the end
" na" "nab" (2) only at the beginning
In this case the change which applies only at the beginning
of a word (2) applies in preference to the change which
applies at the end of the word (1).
If the same match string (including boundary conditions)
occurs in more than one change in a table, the last given
will prevail. Thus, if a table contained the following
lines, "number" would be changed to "last".
"number" "first"
"number" "last"
8. In an instance where one change table makes a substitutiton
string for "a" and also for "ab", the "ab" change will be
made but the "a" change will not also be made. For
instance, in the table
"'" "?"
"'u" "'o"
all occurrences of "'u" willbe changed to "'o" but will not
be changed to "?u". All other occurrences of "'" will go to
"?". To solve this problem, the second line of the change
table should read: "'u" "?o".
Word Change 11
2.3 The default mode
In many cases, virtually all the changes in a table will have
the same condition. For example, suppose that you are working in a
language which does not have prefixes, and you wish to make a number
of changes to roots. It would be possible to insure that the
changes apply only to roots by including a space at the beginning of
each match string. However, this has been made unnecessary by
providing the appropriate "default mode" at the time of running the
program. WRDCHG gives the following prompt:
Should changes be made
(0) anywhere within a word
(1) only at the end of a word
(2) only at the beginning of a word
(3) only when the entire word is matched
Type 0, 1, 2, or 3 :
The effect of answering 0 (or RETURN) is that all changes will occur
exactly as you have specified them in the change table, including
the leading and/or following spaces you have included. The effect
of answering "1" is as though a space were included at the beginning
of each match string; the effect of answering "2" is as though a
space were included at the end of each match string, and the effect
of adding "3" is as though spaces were added at the beginning and
end of the match string. Note that it the appropriate response can
make it unnecessary (though not incorrect) to include a space in the
actual change table. For example, Change Table III applied with
default mode "3" is equivalent to applying Change Table IV:
TABLE III TABLE IV
"yee" "yey" " yee " "yey"
"kyo" "kiw " kyo " "kiw
"pok" "puk" " pok " "puk"
2.4 Making a standard format marker field file
This file gives you the ability to pick and choose which parts
of a standard format file the changes are to apply to. To do this,
merely create a file listing the markers indicating the desired
fields. If you want all fields or if the file does not contain
standard format data, then this file should be empty. The layout of
this file is very free. Thus the following are all equivalent:
(1) The following markers indicate which fields
are to be change:
\w
\i
(2) \w
\i
(3) \w\i
DOCUMENT PREPARATION AIDS 12
2.5 Running the program
When WRDCHG starts it prints the following:
WORD CHANGE Version 2.3 (12-Dec-86)
You are first informed of how much memory is available for a change
table by a message like the following:
SETUP-ALLOC 22832 bytes for records
You are then asked to indicate characters which you wish to have
treated as alphabetic characters along with the standard ones. Note
that all other characters will be regarded as occurring outside of
words. For example, if one wished to change "didn't" to "did not",
the apostrophe (') would have to be treated as an alphabetic
character; otherwise WRDCHG will treat "didn't" as two words, "didn"
and "t".
Type RETURN to include these as alphabetic characters: ~'
Otherwise type the characters desired:
After you respond, WRDCHG will inform you of the characters it will
treat as alphabetic. For example, if you responded by typing a
tilde (~), you will then see the following:
Using the following as alphabetics:
~ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Before you are asked for the name of the change table, WRDCHG
needs to know two things, the "trie level" and the "default mode."
We will now discuss each of these in turn. The change table is
stored in the computer's memory as a type of tree structure, called
a "trie." Tries are more efficient than simple lists in two ways:
(a) it is possible to find entries much more quickly, and (b) for
large tables, more changes can be stored. The degree to which this
efficiency is attempted is set by the number you give in response to
the prompt:
Maximum number of levels in the trie: [99]
If there were nothing to pay for the efficiency, one would simply
strive for the maximum, responding always with a carriage return.
But that is not the case. If the dictionary is not great enough to
take advantage of the density you hope to achieve, more space is
used than necessary. (It's something like packaging soap in economy
sized boxes: if you don't fill them the result takes up more room
than necessary.)
As a rule of thumb, use 2 or 3 for tables with up to 1000
entries. You will probably develop a feel for what number is
appropriate; you might even experiment, loading the same table with
different numbers and seeing which number leaves the most space (as
reported by messages concerning free space given before and after
the change table is loaded). By the way, if you set the number too
low (say 0 or 1 for over 500 entries) the time it takes to find each
change will increase considerably.
Word Change 13
Next you will be asked about the "default mode" by the
following prompt:
Changes should be made:
0) anywhere within a word
1) only at the end of a word
2) only at the beginning of a word
3) only when the entire word is matched
Type 0, 1, 2, or 3 : [0]
This has been discussed above in section 2.3.
Now that you have provided the "trie level" and the "default
mode," WRDCHG is prepared to load a change table. It asks for it
with the following prompt:
Change table file:
When it is finished loading, it informs you of the number of changes
loaded and the amount of storage left. For instance,
235 changes loaded.
24733 bytes left, largest space is 14733 bytes.
Now you are asked for the name of the file that indicates which
specific standard format fields the changes apply to. This is done
by the following prompt:
Standard format marker field file: (<RETURN> for all fields)
See section 2.4 for a discussion of this file. If you want all
fields, then merely press the <RETURN> key. You are next asked for
the name of the file to be changed:
Input file:
You are also asked to give a name for the output file (i.e., the
changed file). WRDCHG makes up a default file name which you can
use by simply responding with a carriage return. For example, if
your input file name is abcdef.sfm, then the prompt for an output
file will appear as:
Output file [abcdef.chg]:
and by simply typing a carriage return you can create the output
file on the default device with the name abcdef.chg. After the file
is processed, you will be informed at the terminal of the number of
words which were read and the number which were altered with a
message like the following:
INPUT: 234 words
234 words read, 7 altered.
WRDCHG allows multiple input files (all to be processed with
the same change file, the same trie level, and the same default
mode). You are asked:
DOCUMENT PREPARATION AIDS 14
Next input file: (<RETURN> if no more)
If you respond with a file name, you will be asked for an output
file name as before, and that file will be processed. If you
respond with a carriage return, you terminate WRDCHG and return to
the monitor.
3. SYLLABLE-BASED SPELLING CHECKING (SYLCHK)
3.1 Introduction
SYLCHK identifies possible typographical errors and
misspellings in texts by judging the phonological well-formedness of
each word: a word is a possible error if it cannot be decomposed
into one or more well-formed syllables. SYLCHK assumes that a
syllable is made up of an optional onset, a vocalic nucleus, and an
optional coda; the user must supply a table of these for the
language to which he is applying SYLCHK. (Obviously SYLCHK cannot
be applied in a language whose writing system does not approximate
phonological form.)
SYLCHK never alters the text to which it is applied. However,
it may be used to correct text files in the following way:
1. SYLCHK is applied to one or more text files, accumulating
the possible errors in a single output file.
2. This error file is sorted and edited to create a change
table for correcting the errors.
3. The change table is applied to the text files with a program
like WRDCHG (in this package) or CC (Consistent Changes).
3.2 Running the program
When SYLCHK is run the following will appear on the screen:
SYLLABLE BASED SPELLING CHECK Version 3.0 (15-Dec-86)
You are then informed of how much memory is available by a message
like the following:
SETUP-ALLOC-10904 bytes for records
You are then asked to indicate characters which you wish to
have treated as alphabetic characters along with wht standard ones.
Note that all other characters will be regarded as occurring outside
of words. For example, if one wished to change "didn't" to "did
not", the apostrophe (') would have to be treated as an alphabetic
character; otherwise SYLCHK will treat "didn't" as two words, "didn"
and "t".
Press <RETURN> to include these as alphabetic characters: ~'
Otherwise type the characters desired:
After you respond, SYLCHK will inform you of the characters it will
Syllable-based Spelling Checking 15
treat as alphabetic. For example, if you responded by typing a
tilde (~), you will then see the following:
Using the following as alphabetics:
~ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
The next thing SYLCHK does is ask for two things relating to the ONC
or Onset-Nucleus-Coda file. Details about the form of this file are
given below in section 3.4. The ONC file tells the program which
characters (and character sequences) are allowed to form correct
syllables. First it asks for the character you have used to
separate ONC distribution classes in the ONC file, and then asks for
the name of this file:
Character which separates ONC distribution classes: [\]
ONC file:
You will then be asked for an orthography change table. If you do
not want to use a change table, simply press <RETURN>. An
orthography change table allows you to normalize the spelling of
words before they are checked, which may be very useful. For
example, in the practical orthography for Quechua, long vowels are
represented as two vowels, (e.g. long /a/ is represented as "aa").
However, in the phonological system, long vowels pattern as a vowel
followed by a consonant, so long /a/ patterns as an /a/ followed by
a consonant [length]. (For a justification of this analysis, see
David Weber and Peter Landerman "The Interpretation of Long Vowels
in Quechua" IJAL, January 1985, pages 94-108.) In order that SYLCHK
treat long vowels in this way, the words are normalized by changing
"aa" to "a:"; "ee" to "e:"; and so on, and ":" is listed as a coda
in the ONC file. The format of an orthography change table is
described below in section 3.5. If an orthography change table is
specified, the program will respond with a message:
Orthography change table file: [None]
5 changes loaded.
Now you are asked for the name of the file that indicates which
specific standard format fields the program will check. This is
done with the following prompt:
Standard format marker field file: (<RETURN> for all fields
See section 2.4 for a discussion of this file. If you want all
fields, simply press the <RETURN> key. Next you will be asked for
an output file:
Output file: [con]
If you simply type a carriage return, the list of possible
misspelled words will be displayed on the terminal. If you type the
appropriate device name to refer to your printer it will be printed
(without creating a file). If you type a file name, the result will
be written to that file. Next, you are asked for the file to be
checked with the prompt
DOCUMENT PREPARATION AIDS 16
Input file:
The program will then begin processing the text. Each time
SYLCHK successfully decomposes a word into well-formed syllables, a
period will appear on the screen, enabling you to watch its rate of
progress. At the end you will see a summary like:
INPUT: 386 words.
73 possible errors in abcdef.ghi
SYLCHK allows multiple input files to be checked (with the same ONC
specifications, etc.). You are asked:
Next input file (RETURN if no more):
If you respond with a <RETURN> the program will terminate and you
will return to the monitor.
3.3 The form of the output
The output file will contain, for each file being checked, its
name, the potential errors found in that file (with each possibly
misspelled word on a separate line), and following the last possible
error, the number of possible errors found in that file.
Possible errors in HGMK01.SFM
akrarkran
hanunn
wais
3 possible errors
3.4 How to write the ONC file
This file informs SYLCHK of the characters and character
strings that are acceptable syllable onsets, nuclei and codas.
These appear in five sets, corresponding to the following
distribution classes:
first = only in syllable onset (e.g., kw, sy, n~)
second = only in syllable coda (e.g., length)
third = in either the onset or coda; if ambiguous,
will be interpreted as onset (e.g., k, ch)
fourth = in either the coda or onset; if ambiguous,
will be interpreted as coda
fifth = in the vocalic nucleus (e.g., a, e, i, o, u)
Members of each set are mutually exclusive of all other sets, that
is, no phoneme can occur in more than one distribution class. The
third and fourth classes are listed as they are to solve the problem
of ambiguity: how does one divide words that are of the CVCVC
pattern? In the third set, onset or coda, phonemes are listed that
can occur as either onsets or codae. If a member of this set occurs
as the middle C in a CVCVC pattern, the program will interpret it
as an onset, that is, CV.CVC. Likewise, phonemes listed in the
fourth set, coda or onset, will be interpreted as a coda if they
occur as the middle C in a CVCVC pattern, that is CVC.VC.
Syllable-based Spelling Checking 17
The beginning and ending of each class is marked by a "\"
(backslash). (Thus, the file should contain 10 \'s.) Any
characters outside of these five regions is treated as comment
(i.e., everything before the first "\", between the second and
third, the fourth and fifth, the sixth and seventh, the eighth and
ninth, or following after the last "\" is comment.) Within each
class, characters and character strings should be separated one or
more whitespace characters (tab, blank or carriage return).
The ONC file also tells SYLCHK what are acceptable syllable
patterns within words. Three patterns are given. The first
describes only initial syllables, the third describes only final
syllables, and the second describes all medial syllables. The
parentheses are used to indicate a syllable, the square brackets
indicate an optional phoneme. Be certain there is no matching
parenthesis or square bracket missing.
Here is a sample ONC file (used for Quechua):
NJSYL.ONC modified for SYLCHK v. 3. by Steve McConnel,
12-Dec-86
ONSET ONLY \ dy br pr by b d dr f fw fy gy h hy kl j
hw ky kw py pw rr sy ty n~ kr bl n~w \
CODA ONLY \ : \
ONSET OR CODA \ ch g k l ll m n p q r s sh t tr ts w y \
CODA OR ONSET \ \
NUCLEUS \ a e i o u a' e' i' o' u' \
SYLLABLE PATTERNS ([O]N[C]) (ON[C]) (ON[C])
Here is a second example of an ONC file describing a the
syllable pattern for To'abaita (Solomon Islands) where the only two
syllable shapes are V and CV.
ONC.TOB by Linda Simons December 1986
ONSET ONLY \ b d f g gw k kw ng ' l m n r s t th w \
CODA ONLY \ \
ONSET OR CODA \ \ considered onset if ambiguous
CODA OR ONSET \ \ considered coda if ambiguous
NUCLEUS \ a e i o u \
SYLLABLE PATTERN ([O]N) ([O]N) ([O]N)
You should not be unduly concerned about making this table complete.
Create a first approximation with those characters that come to
mind, and try it out on a text. It will then quickly become obvious
which characters and character strings you need to add to the table.
3.5 How to write an orthography change table
An orthography change table is a list of paired strings, each
string bounded by double quotes ("). The first string of a pair
specifies some pattern to be matched in a text, and the second
string specifies what is to be substituted for each occurrence of
the matched string. For example, the following is the table used
for Quechua mentioned above:
DOCUMENT PREPARATION AIDS 18
LNGVWL.TAB D. Weber May-30-82
"aa" "a:"
"ee" "e:"
"ii" "i:"
"oo" "o:"
"uu" "u:"
Observe the following in writing a change table:
1. The changes may occur in any order, that is, their order
makes no difference in the effect on the text.
2. All changes should be given in lower case; it is not
necessary to give a change with various capitalizations.
3. Any line whose first printing character is not a double
quote is treated as a comment. (Note, a space or tab could
an effective change, since these are not printing
characters.)
4. Any characters may be placed between the left and right
strings. This allows whatever notation you like to
symbolize the change; the following lines have the same
effect:
"mispelled" becomes "misspelled"
"mispelled" --> "misspelled"
"mispelled" > "misspelled"
"mispelled" "misspelled"
5. Anything following the right string is ignored, so comments
may follow the pair of strings; for example, the following
three changes are effective:
"kachaka" "alliya" `get well'
"qo" "qara" `give'
"fiyupa" "aliska" `very much'
4. SYLLABLE-BASED SPELLING CORRECTION (SYLCOR)
4.1 Introduction
SYLCOR is a program for correcting misspellings and
typographical errors in text. SYLCOR identifies possible errors by
judging the phonological well-formedness of each word: a word
possibly has an error if it cannot be decomposed into one or more
well-formed syllables. SYLCOR assumes that a syllable is made up of
an optional onset, a vocalic nucleus, and an optional coda; the user
must supply a table of these for the language to which he is
applying SYLCOR. (SYLCOR cannot be applied in a language whose
writing system does not approximate phonological form, for example,
Chinese.)
Potential errors in text may be exceptions to whatever method
is used to discover them. For example, if error detection for the
Quechua language is based on phonological well-formedness, then many
Syllable-based Spelling Correction 19
words borrowed from Spanish are exceptions. SYLCOR uses lists
(which you create as you corrects text) to skip such exceptional
words. You might have a list of loan words, a list of
abbreviations, a list of Biblical names, or something else.
Potential errors in text may be real errors. SYLCOR allows
these to be corrected. Context is sometimes needed to determine
what the correct word should be. For example, if you were to
encounter the misspelling "ther" out of context, you would not know
whether it should be corrected to "their", "there", "other", "the",
etc. Therefore, each time an error is suspected, SYLCOR displays a
region of text surrounding the suspect word.
For many errors, you will simply want to correct the error and
continue through the text. For common errors, you may want to have
all subsequent instances corrected automatically. For example, you
might want all instances of "recieve" to become "receive"
automatically. SYLCOR allows you to create (in the process of
correcting text) lists of automatic changes. You may choose to have
each automatic correction presented for your approval before it
modifies the text.
When you begin a session with SYLCOR, the files containing
exceptions and auto-corrections are loaded. At the end of each text
corrected, for each file to which there have been additions, you are
asked if you would like to update the file or backup the additions.
In this way, the files may be enlarged by each session, and
consequently you do less and less work in subsequent sessions.
SYLCOR may be applied to many texts in one session. For each
input text file, a corresponding output file will be created.
SYLCOR deals only with the words of the text, and deals with
them only one at a time. All the format marking, punctuation and
capitalization are passed unchanged from the input text to the
output text.
As mentioned above, SYLCOR uses phonological well-formedness
for detecting potential errors. SYLCOR's error detector is
precisely that of SYLCHK. Both use the same data files, i.e. the
same orthography normalization table and the same file of acceptable
onsets, nuclei and codae. Before running SYLCOR, you might find it
helpful to run SYLCHK on some text; this will help you to develop
the data you need in the tables.
If you intend to put words into a new auto-corrections or
exceptions file during a SYLCOR session, you must create these files
before you run SYLCOR. The files may be empty, but you are
encouraged to place identifying comments in them, according to the
syntax given below (see section 4.9).
4.2 Initiating a session with SYLCOR
After giving the command to run SYLCOR you will see first an
line showing you the amount of available memory. Then you must
respond to some questions so that some files can be loaded and so
that certain options may be set. You are first asked for a setup
file with the prompt:
DOCUMENT PREPARATION AIDS 20
Setup file: [none]
If you do not have a setup file, you must answer a series of
questions interactively at the terminal. If you provide the name of
a setup file, many of the subsequent questions will be answered from
the file, and you will be free to seek the beverage of your choice
while the files load. The following is a sample setup file:
Setup file for using SYLCOR with To'abaita texts
'
2
1
autoco.tob
y
loan.tob
biblic.tob
\
onc.tob
fields.tob
The first line will always be skipped; this allows you to provide an
identifying comment. Subsequent lines provide responses to the
questions in the order the program asks them as discussed below.
There may be from zero to four names of exceptions lists and after
the last exception file is given there must a carriage return. If
some file cannot be found, setting up becomes interactive, and you
must provide the correct responses from the terminal (unless you
want to abort SYLCOR, edit the setup file and try again).
After being asked for a setup file, you will then be asked
which characters you want treated as alphabetic characters in
addition to the standard ones:
Press <RETURN> to include these as alphabetic characters: ~'
Otherwise type the characters desired:
All other characters will be regarded as occurring outside of words.
For example if you wish to treat "oyo't" as a word, include the
apostrophe (') as an alphabetic character; otherwise SYLCOR will
treat "oyo't" as the two words "oyo" and "t". After you respond,
SYLCOR will inform you of the characters it is treating as
alphabetic. For example, if you responded by typing a tilde (~) and
an apostrophe ('), you will then see the following:
Using the following as alphabetics:
~'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
The auto-corrections and exceptions you are about to be asked
for are stored in the computer's memory as a type of tree structure,
called a "trie." Tries are more efficient than simple lists in two
ways: (i) it is possible to find entries much more quickly, and
(ii) for large tables, more changes can be stored. The degree to
which this efficiency is attempted is set by the number you give in
response to the prompt:
Maximum level for trie [no limit]?
Syllable-based Spelling Correction 21
If there were nothing to pay for the efficiency, one would simply
strive for the maximum, responding always with a carriage return.
But efficiency isn't free. If the dictionary is not large enough to
take advantage of the density you hope to achieve, more space is
used than necessary. (It's something like packaging soap in economy
sized boxes: if you don't fill them the result takes up more room
than necessary.)
As a rule of thumb, respond with 2 or 3 for tables with up to
1000 entries. You will probably develop a feel for what number is
appropriate; you might even experiment, loading the same table with
different numbers of levels and seeing which number leaves the most
space (as reported by messages concerning free space given before
and after the change table is loaded). If you set the number too
low (say 0 or 1 for over 500 entries) the time it takes to find each
change will increase considerably.
The next question the program asks is:
Minimum length of words to check: [1]
Here you should indicate the number of characters in the language's
shortest well-formed words.
Next you are asked:
Auto-corrections file: [none]
If you do not want any auto-corrections, simply type a carriage
return. If you have an auto-corrections table to load, respond with
the appropriate file name. If you do not have an auto-corrections
file and you expect to put automatic corrections into such a file,
go no further! Use ^C to get back to the monitor and create a file
(using a text editor). (The structure of this file is described
below in section 4.7.) Run SYLCOR again, and when you get to this
point, respond with its name. The auto-corrections will then be
added to this file. After an auto-correction file is loaded, you
are told how many corrections were loaded. Next comes the question:
Query before any auto-corrections? [y]
If you respond with "n" or "N", auto-corrections are carried out
automatically, without asking you to verify them. The only evidence
you will see of a change is the incrementing of the "auto-corr"
counter on the screen. If you answer "y", "Y" or simply respond
with a carriage return, then each time an auto-correction is
discovered the surrounding text is displayed in the upper part of
the screen and your approval is sought. For example:
"ther" > "their" ? [y]
This forces you to decide case-by-case whether a change is
appropriate. You will probably want to be queried for
auto-corrections at first; if you find that you always answer
positively, then you may feel comfortable about dispensing with the
warning.
DOCUMENT PREPARATION AIDS 22
After the auto-corrections query you will again be informed
about how much free memory is available. Next you are asked for an
exceptions file:
Exceptions file 1:
If you respond to this with a carriage return, SYLCOR will assume
you do not want to use any exceptions files. As with the
auto-correction file, any exceptions files you wish to use must be
created before reaching this point. They need not have any entries,
but you must respond to this prompt with the name of a
previously-created file. An exceptions file is simply a list of
words, all lower case. Its order is not significant. It is good to
have it begin with an identifying comment line. (If this line
begins with a backslash ("\") then the exceptions file can be
periodically sorted and the comment line will stay at the top.)
After Exceptions file 1 has loaded, you will be told how many
exceptions were loaded and informed about the amount of free storage
by a message such as the following:
12963 bytes left, largest space is 6568 bytes.
If either of the numbers gets below 100, SYLCOR may have problems
adding to the exceptions lists or auto-corrections table.
If you have loaded an exceptions file, you will be asked for
another:
Exceptions file 2:
You can use up to four exceptions files during any run of SYLCOR.
This allows you to keep, for example, Biblical names in one file,
linguistic jargon in another, unassimilated loan words in another,
and so on. Some applications will not need all of the exceptions
files; for instance, correcting Scripture would not need the
linguistic jargon and correcting a linguistic paper would not need
the Biblical names. When the question appears again, press <RETURN>
if you have no more exception files.
Next you are asked :
Character which separates ONC distribution classes: [\]
Next you are asked for an "ONC" table:
ONC file?
The ONC file defines the possible syllable onsets, nuclei and codae.
You must write an ONC file for the language to which you are
applying SYLCOR; how to do so is discussed in the previous section
on SYLCHK, section 3.4
Next you are asked:
Orthography change table file: [none]
If you do not want to use a change table, simply press <RETURN>. A
Syllable-based Spelling Correction 23
change table allows you to normalize the spelling of words before
they are checked, which may be very useful. For example, in the
practical orthography for Quechua, long vowels are represented as
two vowels, for example, long /a/ is represented as "aa". However,
in the phonological system, long vowels pattern as a vowel followed
by a consonant, so long /a/ patterns as an /a/ followed by a
consonant [length]. In order that SYLCOR treat long vowels in this
way, the words are normalized by changing "aa" to "a:"; "ee" to
"e:"; and so on, and ":" is listed as a coda in the ONC file. The
format of an orthography change table is described in detail in the
preceding section on SYLCHK, section 3.5. The next question is:
Standard format marker field file: (<RETURN> for all fields)
This file will list the specific standard format fields that
you want SYLCOR to read. See the preceeding section on WRDCHG,
section 2.4, for details of how this file should look. Simply press
<RETURN> if you want SYLCOR to read all fields. Remember that all
the answers to these questions can be put in a setup file as
discussed already.
At this point you will see a message on the screen and finally,
you are asked for an input file:
Input file:
To this you must respond with the name of the file you wish to
correct. If the file is not found, you will be asked to try again.
When the file is found, you are asked for the name of the output
file. SYLCOR makes a default file name which you can use by simply
typing a carriage return; the default writes to default device and
adds and extension .SPL to the input file name. Thus, if you were
editing FUNNEY.SFM, the next prompt would be:
Output file: [FUNNEY.SPL]
Of course, you are free to respond with whatever file name you wish.
(On a two-tape system, you will definitely want to have the input
and output files on different tapes, as otherwise there will be
considerable tape spinning in the course of correcting a file.)
4.3 Screen layout
Suppose you initiate a session with SYLCOR in which you are
correcting a file TEXT.SFM from the default device and putting the
corrected version onto a specified device (such as DD1: or b:) under
the name TEXT.SPL (where the new extension indicates that it has
gone through a spelling corrector). The following appears slightly
above the middle of the screen in reverse video:
+-----------------------------------------------------------------+
|SYLCOR TEXT.SFM > DD1:TEXT.SPL 0 words 0 Errors 0 Auto-corr 0 Exc|
+-----------------------------------------------------------------+
The region above these lines is for the display of text. The region
below is the area in which all your interactions with SYLCOR are
displayed, that is, prompts and your responses, as well as word
editing.
DOCUMENT PREPARATION AIDS 24
As words pass through SYLCOR, the appropriate counts will be
incremented. If you finish working on TEXT.SFM and correct another
file, the new file names will be displayed and the counters will be
reset to zero. Every time a word passes from the input to the
output file, the "words" counter gets incremented. If the word is
phonologically anomalous, but is already on an exceptions list, the
"Exc" counter is incremented. If it is phonologically anomalous but
there is an auto-correction for it, the "Auto-corr" counter is
incremented. If it is phonologically anomalous and there is neither
an exception nor an auto-correction for it, then the "Error" counter
is incremented.
4.4 Handling possible errors: word edit mode
When SYLCOR suspects an error, you are put into "word edit
mode." The word is displayed in reverse video in the top part of
the screen with surrounding text. The following line appears just
below the middle of the screen:
WORD EDIT: <-,->, DEL, CTRL/U, CTRL/R, RETRN when done, ? for help
Below this appears the word you are editing, with the cursor
positioned directly after it. You may now edit this word. Any
character you type will be entered to the left of the cursor, except
for the following, which have the effect indicated:
<- or CTRL/B moves the cursor back (to the left) one character
-> or CTRL/F moves the cursor forward (to the right) one character
DELETE or BACKSPACE deletes one character to the left of the cursor
CTRL/U or CTRL/W deletes the entire word being edited
CTRL/R restores the original word, undoing all the editing
RETURN closes the editing on this word
? prints this message
If you hold down one of the arrow keys, it will move left or right
until you release the key. If you are at the end of the word and
move right, the cursor will cycle around to the beginning of the
word. If you are at the beginning of the word and move left, the
cursor will cycle around to the end of the word.
When you have finished editing a word, press the carriage
return. If you have changed the word, the original word and the
corrected form are displayed as a change, and you are asked if you
want to make this change automatic (by adding this to the list of
automatic changes). For example, if you have changed "yeild" to
"yield", the following is displayed:
"yeild" > "yield" ? [n]
The "[n]" at the end of this line specifies the default value; if
you respond simply with a carriage return, the change will not be
added to the auto-corrections. If you want to add this correction,
respond with "y" or "Y". After you respond to this question, the
program again resumes searching for the next possible error.
Suppose that, instead of correcting the word, you want to leave
it just as it is. To do so, simply respond with a carriage return.
The word will then be unchanged, and you will be asked if you want
Syllable-based Spelling Correction 25
to add it to one of the exceptions lists. For example, if you have
two exceptions files, LOANS.LST and BIBNAM.LST (for loans and
Biblical names, respectively), you will see the following:
Add "xxxxx" to exceptions file?
1 - loan.lst
2 - bibnam.lst
<RETURN> to not save this exception
Type 1, 2, or <RETURN>
To this you must respond with a "1", in which case the word will be
added to LOANS.LST; a "2", in which case it is added to BIBNAM.LST;
or a carriage return, in which case it is not added to any
exceptions list, and the program resumes searching for the next
possible error. (The program will complain about any other
response.)
4.5 Making the auto-correction and exception files
When you finish correcting a text file, and the output file has
been written, you are then asked if you would like to protect the
additions made to the auto-correction and exceptions files. Only
the files to which there have been additions will be considered.
You are asked:
Update auto-corr & all exceptions files to their current names? [n]
If you respond with "y" or "Y", all files to which there have been
additions are updated under the same name and onto the device from
which they were read. Since this involves copying the original
file and then writing out the additions, this can take
considerable time on a tape based system.
If you respond negatively you are given the option to do so
file by file. You will see a prompt like the following:
For auto-corr file NJAUTO.TAB
1 - save both new and old auto-corrections
2 - save only new auto-corrections
<RETURN> to forget new auto-corrections
Type 1, 2, or <RETRUN>:
This gives you the option to (1) rewrite the file with the
additions (which, again, takes a while on a tape-based system)
(2) write out a temporary backup file consisting of only the
entries you have added since your last update, (3) do nothing
about backing up additions. The second alternative takes less
time, but in the event of a problem (e.g., a power failure) you
must later do a separate operation to append the additions to your
original file.
If you are making many additions to the auto-correct and
exceptions files, SYLCOR may ask you to protect these additions
before it gets to the end of the text file you are correcting.
This is because SYLCOR has a limited ability to keep track of all
the new additions. When it gets to the limit, it wants you to
rewrite the file with the additions (i.e., option 1, above) so
that it can start afresh remembering new additions. (Note: option
DOCUMENT PREPARATION AIDS 26
2 above will not do here, as it does not cause SYLCOR to "forget"
the old additions and start a new list.)
4.6 Ending a session with SYLCOR
SYLCOR begins the process of terminating a session when you
respond with a carriage return to the following prompt:
Next input file (<RETURN> if no more):
Since you may have done only temporary backup to this point, and
would now like to do a full backup, you are again asked
Update auto-corr & all exceptions files to their current names? [n]
When the matter of backup is settled, you are asked to replace the
systems tape if necessary and then type a carriage return before
control returns to the operating system:
Reinsert system disk if necessary, then press <RETURN>:
You will then be returned to the system prompt.
4.7 Writing your own auto-correction and exceptions files
It was said above that you must create the files used to hold
auto-corrections and exceptions before you run SYLCOR, but that
when you create them, you need not put in any entries. If you
know beforehand some words you wish to include in these files, you
might as well put them in with your editor. Here we discuss the
syntax of the auto-corrections and exceptions files.
An auto-correction file has the same syntax as an orthography
change table (as defined in section 3.5). Each line should
contain at most one correction. The match string comes first on
the line, followed by the substitution. Both are surrounded by
double quotes. Anything on a line outside the quotes is ignored.
Any line beginning with any printing character besides a double
quote is a comment line and is ignored. Do not use upper case
characters (except, perhaps, in comments)! It is good to start it
with an identifying comment line. It can be sorted periodically
with a line sort, and it can be used with WRDCHG.
An auto-correction file does not need to have anything in it.
Auto-corrections can be added to it by using it as the
auto-corrections file of a SYLCOR session. Thus, you can start an
auto-corrections file simply by creating (with an editor) an empty
file or a file which simply contains an identifying comment line.
Then you can add all the corrections in SYLCOR sessions.
An exceptions file contains words, one per line, with no
quote marks or blanks. Any line beginning with a non-alphabetic
character is ignored and may be used for comments. Again, do not
use upper case characters (except, perhaps, in comments)!
Spelling Correction with Table Lookup 27
5. A SPELLING CORRECTION WITH TABLE LOOKUP (SPLCOR)
SPLCOR is a program for correcting potential misspellings and
typographical errors in text. SPLCOR may be applied to many texts
in one session: for each input text file, a corresponding output
file will be created.
SPLCOR deals only with the words of the text, and deals with
them only one at a time; all format marking, punctuation and
capitalization are passed unchanged from the input text to the
output text. It treats every word as a potential error unless the
word has been previously entered into an "exception" list. It is
possible to have up to four exceptions lists; for example, you
might have a list of loan words, a list of abbreviations, a list
of Biblical names, etc.
SPLCOR allows real errors to be corrected. Since context is
sometimes needed to determine what the correct word should be, a
region of text surrounding the error is displayed. For example,
if you were to encounter the misspelling "ther" out of context,
you would not know whether it should be corrected to "their",
"there", "other", "the", etc.
For many errors, you will simply want to correct the error
and continue on through the text. For common errors though, you
may want to have all subsequent instances corrected automatically.
For example, one might want all instances of "recieve" to become
"receive" automatically. SPLCOR allows you to create (in the
process of correcting text) a list of automatic changes. You may
choose to approve each automatic correction before it modifies the
text or to have it applied without your approval.
When a session with SPLCOR is initiated, the files containing
exceptions and auto-corrections are loaded. At the end of each
text corrected, you can refresh the tape or disk copies of these
files. In this way, they are enlarged by each session, so you do
less and less work in subsequent sessions.
A variant of SPLCOR, called SYLCOR, detects potential errors
on the basis of phonological well-formedness. It is expected that
in the future other spelling correctors will be available which
use the SPLCOR shell but have other error detection methods. If
you have entered (in the process of correcting text) a certain
word, it will be passed as acceptable.
For the details of running SPLCOR, see the documentation of
SYLCOR (section 4). Ignore all references to the orthography
normalization and ONC tables. All other aspects of the SYLCOR are
exactly as in SPLCOR.
6. HYPHENATION (HYPHEN)
6.1 Introduction
Discretionary hyphens are symbols in a text file that
indicate places where word hyphenation at the end of a line is
DOCUMENT PREPARATION AIDS 28
allowed. Just as in English we have rules about where words can
be divided, vernacular languages do also. Having these symbols in
a text as we were working with it would be a nuisance, so the
HYPHEN program can be used to put them in just prior to printing
or typesetting. The discretionary hyphen character is read by the
formatting program Manuscripter (MS) and signals that the word
could be hyphenated there if it occurs at the end of a line when
printing takes place. This feature is especially helpful in
languages that contain many long words. If hyphenation were not
allowed, a lot of space would be wasted at the end of each line of
print.
The HYPHEN program is basically language independent. The
user defines which segments or sequences of segments constitute a
given syllabification class and then defines the hyphenation rules
in terms of these classes. The user also defines which character
sequences constitute overstrike units.
In Spanish, for example, the class of consonants contains the
segments b, l, and r and the sequences br and bl. The class of
vowels contains the segments a, á, and i and the sequences ai and
ái. One hyphenation rule in Spanish is VCV becomes V-CV. Thus
the sequence abri would be hyphenated as a-bri.
The program also allows the user to specify where in the word
hyphenation is to begin and end. Thus one can tell it to not
start hyphenating until there are at least 4 characters at the
beginning and to stop hyphenating when there are 3 characters left
at the end. This would override any hyphenation rules that might
apply near the word boundaries.
HYPHEN also allows one to specify to which standard format
fields the hyphenation process is to apply. In a dictionary,
then, one can have separate classes and rules for the source
language fields (such as \w and \i) and for the target language
fields (such as \d and \t).
If HYPHEN finds a word that has any sequence that has not
been defined, it will display an error message on the screen.
This message will show what the sequence is, what the word is, and
will also state that the word will not be hyphenated.
6.2 Data files
HYPHEN uses four user-defined data files which need to be
created with a text editor before running the program.
6.2.1 Segment definition file
This file contains the information about which segments
and/or sequences belong to which classes. The information is to
be entered in a specified format.
1. All text up to the first occurrence of the word CLASS (or
class) at the beginning of a line is considered to be
comment.
Hyphenation 29
2. The word CLASS (or class) at the beginning of a line
indicates that a new class is about to be defined. The
one letter abbreviation for the class should follow the
key word CLASS. Any other text after that will be
considered comment.
3. From the next line to either the end of the file or to the
next occurrence of the word CLASS at the beginning of a
line, all characters are considered to be either segments
or sequences that belong to that class.
4. Please note that no one unique sequence can belong to more
than one class. Thus "a" cannot both belong to the class
A and the class V.
5. Also note that HYPHEN will always take the longest
possible sequence and assign its associated class to it.
As an example, let's suppose that the following classes
are defined:
CLASS V
a ai i
CLASS C
n r t tr
CLASS M
ain
Then the word "train" will be treated as a "CM" pattern
and the word "trait" would be treated as a "CVC" pattern.
The following shows an example from Campa Pajonal (a language
of the Peruvian jungle). (Note that the front slash (/) and double
quote (") preceding a vowel as well as the tilde (~) before an n
represent overstrikes that a discussed in section 6.2.2.)
Campa Pajonal segment definition file hab 17-May-85
CLASS V Vowels
a e i o u
aa ee ii oo uu
ae oe
/a /e /i /o /u "u
CLASS C Consonants
c ch g j jy m my n ~n p py qu qy r ry
s sh t th ts ty tz v vy y
CLASS N Word medial nasal consonant clusters
mp nqu nth ntz
mpy nqy nts
nc nt nty
nch
DOCUMENT PREPARATION AIDS 30
6.2.2 Overstrike unit file
This file lists the character sequences that constitute
overstrike units. That is, it lists all sequences that will be
printed as one character as the text is passed through a
Consistent Changes print table. This information is used by
HYPHEN to count correctly where to begin or end hyphenating a
word. The sequences are to be entered in a specified format.
1. The first line is treated as comment.
2. All following text is considered to be a list of the
overstrike units. Each unit should be separated by "white
space" (i.e., a space, a tab, or a new line). Capitals
and lower case letters do not need to be distinguished.
The following shows an example from Spanish.
Overstrike definition file for Spanish 05-Jul-85 hab
'a 'e 'i 'o 'u "u ~n
6.2.3 Hyphenation change table
This file contains the hyphenation rules. It is to be
written in the form of a "change table," although it is different
from a Consistent Changes table in several ways.
A change table is a list of paired strings, each string
bounded by double quotes ("). The first string of a pair is
called the "match string"; it specifies some pattern to be
matched. The second string, called the "substitution string,"
specifies what is to be substituted for each occurrence of the
matched string. Please note the following when writing a
hyphenation change table:
1. Any character(s) may be placed between the left and right
strings. This allows whatever notation you like to
symbolize the change. The following lines have the same
effect:
"VCV" becomes "V-CV"
"VCV" --> "V-CV"
"VCV" > "V-CV"
"VCV" "V-CV"
2. Anything following the right string is ignored, so
comments may follow the pair of strings.
3. If a character other than space or tab appears on a line
before the first double quote mark, then that line is
regarded as a comment, and any change on that line will
not be applied. This provides a simple mechanism for
disabling a change: simply put some character ahead of the
first string. For example, the following line would not
make any change:
Hyphenation 31
off "VCV" > "V-CV"
4. The hyphenation rules are ordered and will be applied as
many times as possible. That is, the first change in the
table will be made until it cannot be made anymore. Then
the second change will be made and so on. This feature
has great advantages, but can cause problems if not
properly used. It is possible to create an infinite loop
with this table! Consider the following changes, where C
is the class of consonants, V is the class of vowels, and
G is the class of the single segment glottal.
"CCC" > "Cc-C"
"CC" > "C-C"
"VG" > "Vg-"
Note the order of the changes. If the double consonant
change were put first, it would never see a triple
consonant change (CCC would become C-CC and then become
C-C-C). Note that the first change converts the second C
to a lower case c. This is so that after CCC becomes
Cc-C, the second rule will not then convert the CC to C-C.
Also note that this same "trick" was applied for the VG
change. Without it, we would have an infinite loop: VG
would become VG- which then becomes VG--, and so on.
5. The special symbol # indicates a word boundary. Thus
"#CV" indicates word-initial CV and "CV#" indicates
word-final CV.
Please note the following special restrictions on the above:
1. There must be a one-to-one correspondence between the
number of non-hyphen characters in the match string and
the substitution string. Thus the following will produce
unpredictable results:
"AI" > "V" (too few char's in sub. string)
"C" > "TR" (too many char's in sub. string)
2. When word boundary conditions are indicated in the match
string, the substitution string should also include the
word boundary symbol (#):
"#VCV" > "#V-CV"
"VCV#" > "VC-V#"
6.2.4 Stardard format marker field file
This file allows the user to specify which standard format
fields (in a text containing several fields) are to be hyphenated.
Merely list the format markers which indicated the fields the
hyphenation rules are to apply. They can be entered in any way.
Any text that is not preceeded by a backslash character (\) is
considered to be a comment. The following could be an example for
a dictionary:
DOCUMENT PREPARATION AIDS 32
Pajonl.sfm Campa Pajonal std format marker field file
\w words
\i illustrative sentences
Please note that this file is optional. If no file is specified
when the program is run, all fields will be used.
6.3 Running the program
When HYPHEN is first run, it begins by indicating the amount
of free memory available with a message like:
HYPHENATION Version 1.3 (12-Dec-86)
SETUP-ALLOC-112832 bytes for records
You are then asked to specify which non-alphabetic (i.e., anything
other than a-z) characters are included as specifying words.
Press <RETURN> to include these as alphabetic characters: ~'
Otherwise type the characters desired:
If, for example, you were using ' for accent, ~n for an enyee, and
"u for a dieresis u, you would want to type:
'"~
and then press the <RETURN> key. HYPHEN will then inform you of
the characters it will treat as alphabetic (i.e., are used in
forming a word). Any other characters will be considered to be
punctuation. For example, if you used the example above, the
following will be displayed:
Using the following as alphabetics:
'"~ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
It now asks a series of three questions about how to hyphenate.
The first is:
Discretionary hyphen character: [&]
Type the character you wish to use for the discretionary hyphen
and press the <RETURN> key. It will assume that you want to use
an ampersand (&) if you just press the <RETURN> key. Note that
you will also have to inform Manuscripter of the character you use
for your discretionary hyphen symbol (with the .dh command). Note
that a two-character sequence may also be used (e.g., [- as used
by SIL's Printing Arts Department in Dallas for typesetting). The
second question is:
Hyphenation starts after this many characters: [2]
Enter the minimum number of characters in a word that is
acceptable for hyphenation to begin and press the <RETURN> key.
It will assume you want it to begin after at least two characters
if you just hit the <RETURN> key. The third question is:
Hyphenation 33
Hyphenation stops at this many characters from the end: [2]
Enter the number desired and press the <RETURN> key. It will
assume that you want 2 characters if you just hit the <RETURN>
key. Now it asks for the files that you have created as discussed
in section 6.2. The first one is:
Segment definition file:
Enter the name of your file and press the <RETURN> key. Secondly,
it asks:
Overstrike unit file: (<RETURN> for no overstrike units)
If you have a file specifying which character sequences constitute
one printing segment, enter its name and press the <RETURN> key.
If there are no such sequences, merely press the <RETURN> key.
Note that if you have overstike characters in your text but do not
specify them here, HYPHEN may not correctly delete discretionary
hyphens too near the front or too near the end of a word. Then it
will ask:
Hyphenation change table:
Enter the file name of your change table and press the <RETURN>
key. After it has loaded the file, it will display how many
changes it found. It will now ask:
Standard format marker field file: (<RETURN> for all fields)
If you have a file specifying which standard format fields are to
be hyphenated, enter its name and press the <RETURN> key. If you
want to hyphenate the entire text, merely press the <RETURN> key.
It now asks:
Input file:
Enter the name of the text file you wish to be hyphenated and
press the <RETURN> key. Then it asks:
Output file: [xxxxxx.hyp]
where xxxxxx represents the name given for the input text file.
Enter the name you want for the hyphenated file and press the
<RETURN> key. If you just press the <RETURN> key, HYPHEN will
write your file on the default device using an extension of .hyp.
After it has processed the file, it will display the number of
words it processed and then ask:
Next input file: (<RETURN> if no more)
Enter the name of any additional files to be hyphenated or press
the <RETURN> key.
6.4 Examples
Following are three examples from Peruvian languages. The
DOCUMENT PREPARATION AIDS 34
explanation of the rules, the hyphenation change table, and the
segment definition file are shown for each.
6.4.1 Spanish
6.4.1.1 Hyphenation rules. These are from The New World
SPANISH-ENGLISH and ENGLISH-SPANISH Dictionary, edited by
Salavatore Ramondino (Signet Books, 1969), pp. 553-554.
Consonants
1. ch, ll, rr count as single letters and are never
separated:
pe-cho o-lla pe-rro
2. Single consonants between vowels go with the second vowel:
ca-be-za pa-re-cer
3. The groups pr, pl, br, bl, fr, fl, tr, dr, cr, cl, gr, gl
go with the following vowel and are never separated:
re-pri-mir co-pla te-cla
4. In other groups of two consonants, whether identical or
different, the consonants are divided between the
preceeding and the following vowel:
res-pi-ro hon-ra ac-ción in-no-ble at-las
5. In groups of three consonants, the first two go with the
preceding vowel and the third with the following vowel:
ins-tin-to obs-tá-cu-lo
Exception: the groups listed in 3 above are not separated:
en-tre com-pra tem-plo ins-tru-men-to
Vowels
6. In any combination of two of a, e, or o, the syllable is
divided between the two vowels:
ca-o-ba i-de-a-ción
7. In any combination of two vowels in which one is a, e, or
o and the other is i or u, and there is no accent mark on
the i or u, the vowels form a diphthong and are not
separated:
jo-fai-na vian-da em-bau-car men-guan-te
vi-rrei-na con-tien-da en-deu-dar-se con-sue-lo
co-loi-dal na-cio-nal duo-de-no
If there is an accent mark on the a, e, or o of the group,
Hyphenation 35
the two vowels still form a diphthong and are not
separated:
es-táis es-co-géis cuán-do
If the accent mark falls on the i or u of the group, the
two vowels do not form a diphthong and are separated:
ca-í-da pen-sa-rí-a-mos a-ta-úd re-ú-ne
8. In any combination of i and u, that is, ui or iu, no
division of syllables is made between these two vowels.
This holds whether there is an accent mark or not:
ciu-dad rui-do ca-suís-ti-co
9. In any combination of three vowels in which the first one
is i, u, or ü (more than three do not occur), there is no
division of syllables between any two vowels of the group.
This holds whether there is an accent mark on any of the
vowels or not:
a-pre-ciáis
These rules can be simplified to the following hyphenation
rules and segment defintions.
6.4.1.2 Segment definition file. Table 1 shows the segment
definition file needed for Spanish.
Spanish segment definition file hab/sp 08-Jul-85
This data is from The New Word SPANISH-ENGLISH and
ENGLISH-SPANISH dictionary, ed. by Salvatore
Ramondino,
1969, pp. 553-4 (V. Division of Syllables in
Spanish).
CLASS C Consonants
b bl br c ch cl cr d dr f fl fr g gl gr h j k
l ll m n ~n p pl pr qu r rr s t tr v x z y
CLASS V Vowels
a e i o u
'a 'e 'i 'o 'u
ai ia ei ie oi io ui iu "u'e
au ua eu ue ou uo "ue "ui "u'i
'ai i'a 'ei i'e 'oi i'o 'ui i'u
'au u'a 'eu u'e 'ou u'o
'iu u'i
i'ai i'ei u'ai u'ei "u'ei
DOCUMENT PREPARATION AIDS 36
uia ui'a uio ui'o uie ui'e
Table 1 - Spanish segments
6.4.1.3 Overstrike unit file. Table 2 shows the overstrike unit
file needed for Spanish. An accented vowel is preceded by a
single quote ('), a dieresis on a u is indicated by a double quote
("), and an enyee is indicated by a tilde (~n).
Overstrike definition file for Spanish 05-Jul-85 hab
'a 'e 'i 'o 'u "u ~n
Table 2 - Spanish overstrikes
6.4.1.4 Hyphenation change table. Table 3 shows the hyphenation
change table needed for Spanish.
Spanish hyphenation rules hab 17-May-85
"VCV" > "V-CV"
"CCC" > "Cc-C"
"CC" > "C-C"
"VV" > "V-V"
Table 3 - Spanish hypenation rules
6.4.2 Amarakaeri
This is a Peruvian jungle language which belongs to the
Harakmbet language family.
6.4.2.1 Hyphenation rules. Amarakaeri has the following
hyphenation rules (as provided by Bob Tripp):
1. When a sequence of vowel-consonant-vowel occurs, a break
may be made following the first vowel, except when the
consonant is d, g, or y.
ya-ti-huad
When a vowel is followed by a glottal, the break is made
after the glottal.
o'-hua'-po
When a sequence of vowel-consonant-glottal-vowel occurs,
the break is made between the consonant and the glottal.
mo'-en-'uy-ne on'-haudiay-'uya-te
2. A break may be made between two consonants.
Hyphenation 37
arat-but yan-nig-pee'
The digraph hu should not be broken.
hua-hue' pak-hue'
huey-pa jo-nan-hua-hua-hue'
When a glottal occurs between two consonants, the break
should be made after the glottal.
On'-ka'-a-po on'-no-kie'-uy
on'-tia-huay-po
3. A break may be made between two vowels.
o'-e-a-po hua-e'-e-ri
However, the vowel clusters oe, oe, ee, ae, ia, ie, io, io
should not be broken.
no-poe'-dik on'-no-po'-toe-po
tia-huay-hued be-tio-ka'
When a cluster of three vowels occurs, break following the
second vowel.
a'-nig-pei-a'-po mo'-ma-noe-an-hua-hui-ka'-a-po-ne
In any vowel cluster including a glottal, a break may be
made after the glottal.
ij-no-poe-a'-a-po'i hua-e'-e-ri aro'-en
4. Do not hyphenate leaving a single letter at the beginning
or end of a word.
6.4.2.2 Segment definition file. Table 4 shows the segment
definition file needed for Amarakaeri. An underscored vowel is
indicated by a closing brace (}) preceding the vowel.
Amarakaeri segment definition file hab 15-May-85
CLASS C Consonants
b c f h hu j k l m n p q r s t v w x z
CLASS X Exception consonants
d g y
CLASS G Glottal
'
CLASS V Vowels
a e i o u
}a }e }i }o }u
}o}e }e}e }a}e ia }i}e io }i}o
oe
DOCUMENT PREPARATION AIDS 38
Table 4 - Amarakaeri segments
6.4.2.3 Overstrike unit file. Table 5 shows the overstrike unit
file needed for Amarakaeri. An underscored vowel is indicated by
a closing brace (}) preceding the vowel.
Overstrike definition file for Amarakaeri 06-Jul-85 hab
}a }e }i }o }u
Table 5 - Amarakaeri overstrikes
6.4.2.4 Hyphenation change table. Table 6 shows the hyphenation
change table needed for Amarakaeri.
Amarakaeri Hyphenation Change Table hab 15-May-85
"VCV" > "V-CV"
"VCGV" > "VC-GV"
"VXGV" > "VX-GV"
"CC" > "C-C"
"XC" > "X-C"
"CX" > "C-X"
"XX" > "X-X"
"CGC" > "CG-C"
"CGX" > "CG-X"
"XGC" > "XG-C"
"XGX" > "XG-X"
"VGV" > "Vg-V"
"VG" > "Vg-"
"VVV" > "Vv-V"
"VVGV" > "Vvg-V"
"VV" > "V-V"
Table 6 - Amarakaeri hyphenation rules
6.4.3 Campa Pajonal
Campa Pajonal is a Peruvian jungle language which belongs to
the Arawakan language family.
6.4.3.1 Hyphenation rules. These rules were provided by Allene
Heitzman.
1. The vowels are: a, e, i, o, and length, written as a
geminate vowel, and the vowel clusters ae, and oe.
2. The consonants are: c, ch, g, j, jy, m, my, n, ñ, p, py,
qu, qy, r, ry, s, sh, t, th, ts, ty, tz, v, vy, y.
3. The consonant clusters are (word medial only): mp, mpy,
nc, nch, nqu, nqy, nt, nth, nts, nty, ntz.
Hyphenation 39
4. Break after any vowel preceeding a consonant except before
an m or n in a consonant cluster.
5. Do not break off less than four letters.
6.4.3.2 Segment definition file. Table 7 shows the segment
definition file needed for Campa Pajonal.
Campa Pajonal segment definition file hab 17-May-85
CLASS V Vowels
a e i o
aa ee ii oo
ae oe
/a /e /i /o
CLASS C Consonants
c ch g j jy m my n ~n p py qu qy r ry
s sh t th ts ty tz v vy y
CLASS N Word medial nasal consonant clusters
mp nqu nth ntz
mpy nqy nts
nc nt nty
nch
Table 7 - Campa Pajonal segments
6.4.3.3 Overstrike unit file. Table 8 shows the overstrike unit
file needed for Campa Pajonal. An accented vowel is preceded by a
single slash (/), and an enyee is indicated by a tilde (~n).
Overstrike definition file for Campa Pajonal 06-Jul-85
hab
/a /e /i /o ~n
Table 8 - Campa Pajonal overstrikes
6.4.3.4 Hyphenation change table. Table 9 shows the hyphenation
change table needed for Campa Pajonal.
DOCUMENT PREPARATION AIDS 40
Campa Pajonal hyphenation rules hab 17-May-85
"VC" > "V-C" c break after any vowel preceding
a consonant
c do not break if it is an m or n
in a consonant cluster
Table 9 - Campa Pajonal hyphenation rules
6.5 Miscellaneous
6.5.1 Program limitations
While HYPHEN is quite general, it does have some limitations.
1. If a text has a mixture of vernacular and loan words,
HYPHEN will try to hyphenate the loan words according to
the rules of the vernacular. If the loan word contains
some undefined sequence, then HYPHEN will ring the
terminal bell and display an error message for the word
and will not hyphenate it. (This is actually a
fundamental problem of identifying loan words within a
text).
2. In version 1.2, HYPHEN correctly handles a text containing
Manuscripter bar commands (such as |b or |u). Earlier
versions used to treat the b or u as a part of the word to
be hyphenated and it would lose any capitalization of a
word preceded by a bar command.
3. HYPHEN assumes that the orthography consists only of
lowercase alphabetics. Thus it is not able to tell the
difference between upper and lower case letters, even if,
say capital letters were used to represent unvoiced
vowels. Both will be treated as if they were lower case.
In order for HYPHEN to correctly handle this situation,
one will need to represent the unvoiced sound by some
other unique sequence.
6.5.2 Testing method
The following is a method one can use to test one's segment
definition file and hyphenation change table.
1. Create a file that consists of the example words listed in
the hyphenation rules. Put each word on a separate line.
2. Then make two copies of each word, each one on a separate
line.
3. Place a backslash character in front of the first
occurrence and insert hyphens where they should go.
HYPHEN will then treat this as a standard format marker
and not as a word.
Hyphenation 41
4. Insert a space in front of the second word.
5. Run the file through the HYPHEN program and examine the
results. If hyphenation has occurred correctly, the two
occurences of the word will line up exactly.
Here is an example of part of such a test file for Spanish.
\o-lla
olla
\ca-be-za
cabeza
\re-pri-mir
reprimir
\co-pla
copla
\te-cla
tecla
\res-pi-ro
respiro
\obs-t'a-cu-lo
obst'aculo
The output would then look like this:
\o-lla
o-lla
\ca-be-za
ca-be-za
\re-pri-mir
re-pri-mir
\co-pla
co-pla
\te-cla
te-cla
\res-pi-ro
res-pi-ro
\obs-t'a-cu-lo
obs-t'a-cu-lo
6.5.3 Some change table techniques
One can use the fact that the hyphenation rules are ordered
to one's advantage. Consider an example from Ticuna, a Peruvian
jungle language. The sequence arj needs to be hyphenated as -arj
word finally and a-rj elsewhere (j is a vowel). The segment
defintion file includes the following classes:
TIPHYP.SEG Character classes for Ticuna (Peru) hyphenation
CLASS V
e i o u
CLASS C
b c ch d f g l m n ~n ng p q s t w y
CLASS A
a
CLASS J
j
CLASS R
DOCUMENT PREPARATION AIDS 42
r
Notice that a, r, and j are in separate classes by themselves.
The hyphenation rules include the following changes:
TIPHYP.CHG changes for hyphenation of Ticuna (Peru)
"ARJ#" > "-arj#"
"A" > "V"
"J" > "V"
"R" > "C"
"VCV" > "V-CV"
Note here that the word final exception is treated first. If the
sequence arj is not word final, then the second through fourth
changes will convert the "ARJ" class sequence into a "VCV" class
sequence. This allows the final change to make the correct
hyphenation.
7. DELIMITER CHECKING AND NESTING CHECK (DELIM)
7.1 Introduction
Delimiters are symbols used in pairs to enclose specific
information. The most common delimiter pair is parentheses.
others are square brackets or curly braces. DELIM tests whether
delimiters are paired and properly nested. The user may specify
the delimiters to be checked; for example, he may wish to check
the following:
( ) { } " " ` ' [ ] < >
DELIM reports the errors in such a way that they are easy to find.
Multiple files may be checked. DELIM never changes the file that
is being checked.
DELIM is useful for the preparation of any text which makes
use of delimiters. For example, many linguistic papers have
frequent parentheses, phonetic and phonemic bracketing ([] and
//), and glosses (`') all of which must be balanced and properly
nested, for example, [atox] /atuq/ `fox'. Sometimes formatting
programs (e.g., SCRIBE) and often programming languages (e.g.,
PTP, C) require heavy use of delimiters. (While errors in these
can sometimes be discovered by running the program, it will
generally be much quicker to discover the errors with DELIM and
correct them before running the program.)
7.2 Running the program
DELIM begins to run with the following message:
DELIMITER PAIRING AND NESTING CHECK Version 2.1 (12-Dec-86)
Press <RETURN> to use these delimiters:
({["
)}]"
Delimiter Checking and Nesting Check 43
Otherwise type delimiter file name:
If you are satisfied with this list of delimiters, simply type a
carriage return. Otherwise specify the name of the delimiter file
that includes the delimiters you want to check. The form of such
a file is discussed in section 7.4. Next you will be asked for an
output file:
Output file: [con]
If you simply type a carriage return, the output will be put to
the terminal. If you wish to have the output printed directly
(i.e., without first creating a file on some device), respond with
prn (or however you refer to your printer). If you type a file
name, the result will be written to that file. Next, by means of
the prompt
Input file:
you are asked for the file to be checked. Respond with the
appropriate file name. When DELIM finishes checking the first
file, it asks for another file to be checked:
Next input file: (<RETURN> if no more)
When there are no more files to be checked, simply type a carriage
return to return to the monitor.
7.3 The form of the output
The output file will contain, for each file being checked,
its name, the potential errors found in that file, and the number
of potential errors found in that file.
There are two sorts of errors. First, there might be a right
delimiter for which there was no previous corresponding left
delimiter. For example, if a file started with the line
This is a file ] which has an error.
the error would be reported as follows:
unmatched right ] on line 1
This is a file ] which has an error.
^
If a 15 line file ended with
This is a file { which has an error.
the error would be reported as follows:
unmatched left { on line 15
This is a file {
^
DOCUMENT PREPARATION AIDS 44
7.4 How to write a delimiter file
To specify delimiters other than the defaults, it is
necessary to create a delimiter file. This file contains two or
three lines. The optional first line is reserved for comments
such as "Delim file for XYZ." The second line should list
(without intervening spaces, commas, etc.) all the left
delimiters. The third line should list the corresponding right
delimiters, with each right delimiter directly below the
corresponding left delimiter. For example, The following is an
acceptable delimiter file (where there is nothing on lines two and
three other than the delimiters, and all lines end with a carriage
return):
This is a DELIM file for XYZ
[{(
]})
Any character can be given as a delimiter, but note, a
delimiter can only be a single character.
If the last two lines of the delimiter file are not the same
length, you will be informed with the message when the program
runs:
Delimiter lists are not the same length.
It is possible (and sometimes desirable) to give the
delimiters to be checked directly from the terminal. This can be
done by giving the terminal device name in response to the prompt
(tt: for RT-11, con for MS/DOS) for a delimiter file name, typing
the two lines of left and right delimiters, and then closing the
file with a ^Z (control Z). For example, if one wished to check
only the delimiter pairs ( ) and [ ], he could respond to the
prompt for a delimiter file with tt: (on RT-11 systems), then type
the sequence:
( [ <RETURN> ) ] <RETURN> ^Z
7.5 Program limitations
One actual error sometimes causes DELIM to report many errors
(i.e., errors are said to "cascade"). Thus, sometimes error
messages subsequent to a real error should simply be disregarded.
If the real error is fixed, the subsequent (erroneous) error
messages go away.
Too many unmatched left delimiters (more than approximately
15) will cause DELIM to terminate with a message beginning "Stack
overflow..." If this happens, control is returned to the monitor.
Try checking the file with fewer delimiter pairs or correct what
errors you can and rerun the program.
Delimiters cannot span files, that is, corresponding
delimiters must be in the same file. DELIM does not ignore
delimiters in comments or in quoted strings. DELIM can check only
99 pairs of delimiters.